Mid-infrared cross-comb spectroscopy

Dual-comb spectroscopy has been proven beneficial in molecular characterization but remains challenging in the mid-infrared region due to difficulties in sources and efficient photodetection. Here we introduce cross-comb spectroscopy, in which a mid-infrared comb is upconverted via sum-frequency generation with a near-infrared comb of a shifted repetition rate and then interfered with a spectral extension of the near-infrared comb. We measure CO2 absorption around 4.25 µm with a 1-µm photodetector, exhibiting a 233-cm−1 instantaneous bandwidth, 28000 comb lines, a single-shot signal-to-noise ratio of 167 and a figure of merit of 2.4 × 106 Hz1/2. We show that cross-comb spectroscopy can have superior signal-to-noise ratio, sensitivity, dynamic range, and detection efficiency compared to other dual-comb-based methods and mitigate the limits of the excitation background and detector saturation. This approach offers an adaptable and powerful spectroscopic method outside the well-developed near-IR region and opens new avenues to high-performance frequency-comb-based sensing with wavelength flexibility.


Setup and optical spectra
The setup diagram is depicted in Supplementary Fig.1, and the optical spectra are depicted in Supplementary Fig. 2. filter. The supercontinuum, used as the readout FC after a bandpass filter, is from a monitor port (tap output) of the PCF pumped by the local FC in its f-to-2f module. The SFG signal and readout FC are mixed in a commercial 2x2 50:50 wideband fiber optics coupler (Thorlabs TW1064R5A2A), before which the two beams are coupled from free space into fiber by commercial fiber collimators. The configuration of the fiber coupler is further illustrated in Supplementary   Fig. 3a. For the SFG signal, two commercial freespace shortwave pass filters (Spetrogon SP-1300) are used to block residual NIR local power and MIR target power. For the readout FC, the bandpass filter (Delta Optical Thin Film A/S LF104008) for the supercontinuum is centered at 1140nm with a full-width at half-maximum (FWHM) bandwidth of 15 nm. The NIR detector is a commercial InGaAs fiber-coupled balanced detector (Thorlabs PDB415C).
The target FC is provided by a chain of two cascaded half-harmonic OPOs 1 . Pumped by a commercial mode-locked Yb: fiber laser centered at 1.045 µm, the first half-harmonic OPO generates 2.09-µm pulses, which are then used to pump the second half-harmonic OPO at 4.18 µm. Half-harmonic OPOs feature intrinsic phase and frequency locking of their output to the pump 2 ; thus the phase and frequency of the 4.18-µm OPO are intrinsically locked to that of the 1.045 µm pump. Hence, by locking the ( ) of the 1.045-µm laser to that of the 1.55-µm Er: fiber laser (local FC), the target FC (4.18-µm OPO) is locked to the local FC. In this experiment, the and of the local FC and 1.045-µm Yb fiber laser (target FC) and all measurement apparatus are locked to a 10-MHz RF rubidium (Rb) clock, ensuring a common frequency standard.
Supplementary Fig. 2| Optical spectra of frequency combs used in the experiment. a. Target FC for both "without sample" (purged) and "with sample" (unpurged) cases, measured by a commercial Fourier-transform infrared spectrometer (FTIR) with a resolution of 4 cm -1 . The residual CO2 which cannot be fully cleared by purging is the reason why the absorption dip can still be observed in the "without sample" curve. b. Local FC spectrum provided by the manufacturer (Menlo Systems). c. Spectra of SFG FCs (with and without sample) and readout FC measured by a grating-based OSA with a resolution of 0.5 nm.

Temporal gating
For the same reason as in EOS 3 , our cross-comb method can also benefit from the temporal gating (also referred to as "nonlinear gating"), although our local pulse is not as short as that of EOS. However, this effect cannot be seen in the measurement shown in Fig. 3 of the main paper because the balanced detection conceals the strong background. Corresponding to band A and band D of the RF FC (see Section 2.4), the background is a common-mode signal only from the port of the SFG FC and thus is cancelled by the balanced detection. Note that the background at the center-burst is basically an intensity crosscorrelation of the target pulses and local pulses, so it is delay ( , lab-time)-dependent unlike DCS (See Fig. 1 of the main paper).
As shown in Supplementary Fig. 3, if we tweak the coupling of the splitter to the balanced detector (panel (a)) such that it is not well-balanced, the strong background will show up prominently at the center-burst (panel (b)). However, because of the temporal gating, the beating at the tail, which contains useful information, is free from any undesirable background power from the strong pulse center. The complete description and importance of temporal gating is more complicated and can be found in Section 3 and 4.2.
Note that the balanced detection can only "conceal" the background in its balanced RF output, but it cannot solve the problem caused by the strong background. Although a well-balanced detector can cancel the common-mode signal and noise in its balanced RF output by comparing the outputs of two photodiodes, there may still be strong optical power incident on each photodiode which is not visible in the balanced output. The strong incident optical power can bring in noise which is not common-mode (e.g., shot noise) and thus cannot be cancelled (indeed they add up), and it will ultimately saturate the photodiodes. This problem exists in detection of weak FID signal for DCS. More illustrations can be found in our simulation (Section 5). TIA: transimpedance amplifier. b. Interferograms measured when the detector is not well balanced. The main figure presents the central 14-µs part of one example interferogram in order to highlight the details of the center-burst (inset Ⅰ) and the tail (inset Ⅱ). Note that the measurement is done when the detector is tuned to be just slightly unbalanced. The background at the center-burst is actually very strong and can heavily saturate the TIA if the detector is further unbalanced. Note that the result shown here is from an older measurement where local pulses with lower power are used; thus, its FID signal is lower than that of the Fig. 3 of the main paper. Because of the reasons mentioned in the main paper, currently we are only able to acquire ~0.5-s data with a of 1 kHz, which gives a SNR sum of 8.03×10 5 (the highest blue point). To estimate a figure of merit (FOM), we need to scale the number in two ways. Firstly, we assume data acquisition over a full second, which can give twice as many interferograms. Secondly, the signal currently only takes up 28 MHz out of the whole available 125-MHz spectrum (half repetition rate of the local/target FC). If we could set the to be ~4.5 kHz (we did not do that in the experiment due to some limitations of our detector and locking electronics), we would be able to get ~4.5 times more interferograms. In total, we can realistically obtain 9 times as many interferograms in our measurement, which leads to an estimation of FOM of 2.4×10 6 (yellow dashed line and purple triangle).

Estimation of experimental signal-to-noise ratio (SNR) and figure of merit (FOM)
In addition, here we explain how we estimate the upconversion efficiency of our experiment and how it compares to that of prior work using C.W. upconversion 4 . In that work, a 700-µW MIR FC and a 2.4-W C.W. laser are used to get a 1.3-µW upconverted NIR FC with a 20-mm PPLN crystal. Note that all the power in this subsection refers to average power. Therefore, their upconversion efficiency can be calculated as: In our experiments, we use 500-mW target pulses and 200-mW local pules to generate a SFG signal of ~100 nW with a 1mm PPLN crystal. However, in our case, the 100-fs local pulse scans through the 50-fs target pulse due to their different repetition periods (see Fig. 1e of the main paper), so the real time interval in which the two pulses overlap (when SFG power is generated) only accounts a very small portion of the full period. Specifically, when we set the = 1 for the , ≅ , = 250 , the local pulse will scan through the target pulse in a total of 2.5 × 10 steps with a step size of approximately 16 fs (meaning the relative position of local and target changes by about 16 fs for each local pulse which samples the target) over the course of an interferogram. Since the target pulse is only around 50 fs, there are only 3~4 steps in which the local pulse overlaps with the target pulse well out of those 2.5 x 10 5 steps, and moreover, there is only at most one step that the two pulses overlap perfectly (the maxima of the interferogram). Therefore, we can estimate an effective overlap coefficient (the "duty cycle" of SFG generation) to be about 10 , which must be factored into the calculation of the SFG efficiency for a fair comparison. Therefore, the efficiency is: Our upconversion efficiency is more than two orders of magnitude higher than the C.W. upconversion work, although the average power of our local FC is just one tenth of theirs. Note that we couple the generated SFG from free space to a singlemode fiber and then measure its power with a fiber-coupled OSA. However, there could be a large loss in the free space-fiber coupling which results in an underestimation of the measured SFG power. Therefore, this efficiency could be correspondingly underestimated.

Phase correction and broadened absorption linewidth
Supplementary Fig. 5|. Full measured absorbance spectrum of atmospheric CO2 and phase correction process. a. Full measured absorbance spectrum of atmospheric CO2 in our preliminary cross-comb measurement, including both P and R branches. The SNR of the R branch is lower than that of the P branch.
Also, the absorption lines of the R branch are broader than that of the P branch. The spacing between absorption lines in R branch is smaller than that of the P branch. b. 3D interferogram of the raw data of the without-sample measurement. c, 3D interferogram of the corrected data of the without-sample measurement, around center-burst. d. 3D interferogram of the corrected data of the with-sample measurement, around center-burst. e. 3D interferogram of the corrected data of the with-sample measurement, around first peak in FID at a fast lab time interval of [4,5] µs.
Supplementary Fig. 5a shows the full measured absorption spectrum, whose right side (higher optical frequency, R branch) has a worse SNR and more broadened absorption linewidth compared to its left-side counterpart (lower optical frequency, P branch). This is because the phase noise (uncontrolled broad relative linewidth) between the target FC and local (readout) FC has a larger effect on the R branch, as explained below.
show the phase correction for the interferograms, where "3D interferograms" are presented. In those 3D interferograms, each column is a single interferogram (detector voltage is denoted by the colormap), and consecutive single interferograms (columns) are plotted from left to right. Therefore, the vertical axis denotes the "fast time" within each single interferogram, and the horizontal axis denotes "slow time" that shows the time spacing between each interferogram.
Panel (b) shows the without-sample 3D interferogram with the fast time zoomed in to [-10,10] µs to show the center-burst.
Ideally, the center-bursts of each single interferogram should perfectly align at = 0, but they shift due to the phase noise and timing jitter between the two fiber lasers since no tight locking is applied to them 5 . Two steps are taken to correct these shifts. Firstly, the maxima of the envelopes of each interferogram are shifted and aligned to correct the timing jitter (the envelope of the interferograms is obtained by Hilbert transform). Secondly, a zero-order phase term is applied to each interferogram to make its phase at the maxima of the envelope zero to correct the zero-order phase shift between interferograms.
The obtained results are shown in panel (c)-(e). At the center-burst of both without-sample and with-sample case (panel (c) and (d)), the corrected interferograms overlap very well and thus can be coherently averaged. This is largely because our correction uses the information from the sharp peak structure at the center-burst of the interferograms and thus provides reasonably good correction over the whole center-burst, since it is generated by the interference of two short, femtosecond pulses. This is sufficient for the without-sample measurement since its information only exists around the center-burst. However, for the withsample case, although its center-bursts are aligned well and can be averaged, prominent phase error still exists at larger fast times, for example, at the first strong FID peak (panel (e)). This is because the coherence time between the two combs is smaller than 4 µs, so the zero-order phase correction at the center-burst is not able to fully correct the error at the more distant FID.
The larger the fast time (delay ), the larger the phase error between different interferograms, and the less they can be averaged.
In other words, the relative comb linewidth between our target and local (readout) combs remains broad because their repetition rates are independently locked only to a RF standard. 5 This explains why our measured absorption linewidth is larger than the theoretical value; although the center-burst of the with-sample measurement can average well, the FID signal at large delay cannot. This is like a window function is applied to the time-domain of the averaged with-sample interferogram, which is equivalent to a sinc function convolved to its spectrum which broadens all its spectral features. The R branch of the absorption spectrum suffers more from this undesirable effect because its corresponding time-domain information exists at a larger time delay (fast time) due to a smaller spacing of its absorption lines (see Supplementary Fig.5 (a)), which means a larger phase error and less averaging. More detailed analysis about temporal and spectral features of the CO2 absorption can be found in ref 6 . As mentioned in the main text, this issue can be solved by setting one intermediate C.
W. reference to provide information for either tight-locking by fast actuators, or error correction by data processing, as has been well demonstrated in dual-comb spectroscopy.
However, this effect does not influence the without-sample measurements since their information only exists close around the center-burst in the time-domain where phase error and timing jitter can be corrected based on the information from the sharp peak structure of the center-burst itself. Therefore, in terms of SNR, the obtained without-sample measurements after our phase correction are comparable to the results that can be obtained if the relative comb linewidth (phase noise and timing jitter) between target and local combs are ideally controlled. Hence, our estimate of the FOM of our spectrometer using the SNR of the without-sample measurements is fair.
One-to-one tooth mapping from target FC to RF FC

Target FC and Local FC
The electric field of the local FC can be described by where denotes the complex amplitude that encodes both the intensity and phase of the local comb tooth with optical frequency , and the spatial dependence is omitted here. The superscript "L" of and denotes local FC, and the subscript " " corresponds to the ℎ comb tooth. In addition, for the optical frequency , we have where , and , are the repetition rate and carrier-envelope offset (CEO) frequency of the local FC, respectively. Sometimes it is not convenient to directly use " " to index comb teeth, since the first tooth usually occurs at very large m.
To be specific, for the first tooth of a practical frequency comb, ~ 10 . For convenience, here we define the effective tooth index, ′, which starts at 1. If we use to denote the tooth index of the first local tooth, we have: Then, for any teeth ( ′ starts from 1), Like the local FC, we use to denote the complex amplitude of the target comb tooth with the optical frequency .
We have Where denotes the repetition rate detuning between the local and target FC, i.e., , = , + .
Similarly, we can also define the effective tooth index for target comb teeth, indices are used to label each comb tooth in the plot. b. SFG FC and readout FC. Each SFG tooth is labeled by the effective tooth index ("1", "2", or "3") of the corresponding target tooth. The phase for the readout FC is assumed to be constant for each tooth and is thus not shown in the plot. denotes the primary readout frequency for the (effective tooth index) target tooth. Each SFG group is labeled by its effective group index G'. c. RF FC. Every RF comb tooth in band B and band C is labeled with its corresponding target tooth.
Using the notation introduced above, the frequency-domain picture of cross-comb spectroscopy is depicted in Supplementary Fig. 6. To make a concise and clear illustration, only three teeth are included for both FC, and simple random numbers are assigned to their optical frequencies, which are of arbitrary unit ( Supplementary Fig. 6a). Note that generality is not lost by assigning , = 0, since in practice it is just the relative between the two FCs that matters. Although only a small number of comb teeth and simple random numbers are used for the following illustrations and equations, the conclusions still hold when scaled to practical numbers.

SFG FC
Because of the slightly detuned repetition rates between the local and target FCs, each pair of teeth from them will generate an SFG tooth at a unique frequency, the set of which are referred to as the SFG FC. The electrical field of a certain SFG tooth can be described by (phase-matching effect is not included here) , = exp − 2 , , , = + (9) As shown in Supplementary Fig. 6b, the resultant SFG comb teeth cluster into different frequency groups 7 , which can be indexed by the group index = + (or effective group index = + ′), and the groups are evenly spaced by , = 1.
The group ′ is generated by the SFG between the … ( − 1) , , ( + 1) … target teeth and the … ( + 1) , , ( ′ − 1) … local teeth. Note that the center group, with ′ = 4, contains information about all the target teeth, in spite of the fact that different target teeth are modulated by different local teeth (see also Fig. 1 of the main paper). Such a group that contains the information for all target teeth is called a "complete (SFG) group" in the following context. It is readily seen that the number of complete groups formed is determined by the number of local teeth relative to target teeth.
More patterns can be observed within SFG groups. Firstly, SFG teeth in a single group are separated by . Secondly, mixing with different local teeth, a given target tooth will generate multiple SFG teeth, which are all at the same relative frequency position in their respective SFG groups. To illustrate the second pattern, each SFG tooth in Supplementary Fig. 6b is labeled by its corresponding target tooth ("1", "2", or "3"). The pattern is made clearer still if readout teeth are introduced as frequency references (see next subsection). These patterns make it possible to do one-to-one mapping between the MIR and RF domains.

Readout FC
To read out the spectral information of the target FC contained in the SFG FC, another comb, referred to as the readout FC, is employed to beat with the SFG FC on a square-law photodetector. The readout FC is effectively a spectral extension of the local FC and therefore inherits its and . As shown in the Supplementary Fig. 6b, readout comb teeth can be regarded as "boundary markers" for SFG groups, since they share the same constant distance , between each unit. For a certain SFG group, we name its closest (second closest) readout tooth as its "primary (secondary) readout tooth". For a certain SFG tooth within a SFG group, we name the frequency difference between the tooth and its primary (secondary) readout tooth as its "primary (secondary) readout frequency", and the sum of its primary and secondary readout frequencies is , . As shown in the illustration, the SFG teeth generated by the same target tooth always have the same primary readout frequency, even though they are distributed in different SFG groups and correspond to different primary readout teeth. Also, SFG teeth generated by different target teeth have different primary readout frequencies, denoted by in the illustration. These two patterns are very important and provide the foundations for the one-to-one mapping.
As with the local and target FCs, we use " " to denote the complex amplitude of the comb tooth of the readout FC.
Also, we can define the effective tooth index for readout comb teeth: Note that , = , and , = , .

RF FC, one-to-one mapping, and absorption spectrum
Based on the SFG and readout comb teeth in the optical domain, one can calculate the resultant RF spectrum detected by a single square-law detector. The bandwidth of the detector is assumed to be "1" ( , ), which means that the highest RF frequency the detector can detect is the repetition rate of the local FC, , . This is a common condition for many works in dualcomb spectroscopy. To calculate the RF signal (photocurrent) at a given RF frequency, one must sum the contributions from all the comb tooth pairs that can generate heterodyne beating at this frequency.
and denote the complex amplitude of the two involved comb teeth, which can be from the SFG or readout FC. The RF frequency of the beating signal, , is equal to the difference between the optical frequencies of the two involved comb teeth.
As shown in the illustration, RF FC comb teeth can be classified into four bands 8 . Band A consists of the intra-group beat notes, which are generated by two SFG teeth from the same SFG group. Band D is also composed of beat notes generated by two SFG teeth, but the two teeth are from two different adjacent SFG groups. Note that the frequency component with = = 1 is a special component in band D which also includes the contribution from beatings between two readout teeth. Band A and band D result from only the SFG FC (excluding = 1) and correspond to the envelope of the SFG pulses (crosscorrelation signal between target and local FC) in the time domain, which doesn't contain much useful information for our purpose. In contrast, band B, consisting of beat notes between SFG teeth and their primary readout teeth, is a one-to-one mapping of the original target FC. As demonstrated in the equations, the complex amplitude of a certain band B RF tooth is related to and directly proportional to that of only one target tooth, although it is generally modulated by more than one local tooth and readout tooth. Like band B, band C is also a one-to-one mapping of the original target FC, resulting from beating between SFG teeth and their secondary readout teeth. Band B and C contain the exact same information regarding the target FC, which are mirror images of each other, reflected about , /2 in the RF domain.
Based on the one-to-one mapping, the absorption spectrum in the MIR region interrogated by the target FC, including both amplitude and phase, can be obtained by comparing the RF band B (C) measured with the sample in the path and the corresponding result measured without the sample in the path (reference).

Universality
In our experiment, we use a MIR synchronously pumped degenerate OPO (centered at 4.18 µm) and Er-doped fiber laser (centered at 1.55 µm) as the target FC and local FC, respectively. The readout FC is a band-pass filtered portion of a supercontinuum pumped by the local FC, which is generated in a photonic crystal fiber (PCF). It should be noted that the scheme of cross-comb spectroscopy (CCS) doesn't have any limitation on the laser techniques used for the frequency comb generation. However, the current implementation benefits from the intrinsic phase locking of the mid-IR comb to the Yb: fiber laser pump. Also, as a special case of CCS, the local FC or readout FC can be replaced by a "frequency comb" with only one tooth, i.e., a C.W. laser. This is explained in depth in the following section.
Moreover, in this derivation, we demonstrate the frequency-up-conversion one-to-one comb tooth mapping by SFG. In fact, it is also possible to realize one-to-one mapping by difference frequency generation (DFG), the derivation of which is very similar. This may be useful in the application of frequency-comb-based spectroscopy in the ultraviolet spectral range or for even shorter wavelengths.

Bandwidth requirements for Local FC and Readout FC
To realize one-to-one mapping for all teeth of the target FC, local FC and readout FC, one must satisfy some requirements which will be discussed in detail in this subsection. To provide a concise discussion, we continue to use the simple illustration above, keeping the number of target teeth to be three but varying the number of local teeth to be 2, 3, or 4. The results are shown in the Supplementary Fig. 7. N, M, and Q denotes the number of teeth of the target, local and readout FCs, respectively.
Supplementary Fig. 7| Bandwidth requirements for local FC and readout FC. M, N and Q denote the number of comb teeth for the target, local, and readout FCs, respectively. a. M=N=3, Q must be >= 1. The only complete SFG group, together with its primary readout tooth, is circled in red. b. N=2, M=3, Q must be >=2. Two incomplete SFG groups circled in red need to be read out by two readout teeth to map all three target teeth. c. N=3, M=4, Q must be >=1.
Two complete SFG groups, together with their primary readout teeth, are circled in red.
As shown in the panel (a), when = , there is only one complete group (circled in red) formed in the SFG FC, which alone contains the information from all target teeth. Thus, to read all target information out, one readout tooth is required at minimum (R>=1), where the equality holds if and only if the readout tooth is the primary (or secondary) readout tooth of that complete group.
If we have one less local tooth (M=2, panel (b)), there is no complete group formed in the SFG FC, and at least two readout teeth are needed to read all three target teeth out (Q>=2). Similarly, to make the equality hold, the readout teeth need to be the primary (or secondary) readout teeth for those two center SFG groups, which are circled in red.
When there is one more local tooth relative to the number of target teeth (M=4, panel (c)), there will be two complete groups (circled in red) formed in the SFG FC. As in the case of L=3, one readout tooth is enough to read out all the target information (Q>=1). However, because of the availability of more complete groups, the requirement of the location of the single readout tooth to make the equality hold is more relaxed compared to the case of M=3. Here, it can be the primary (or secondary) readout tooth of either complete group.
This discussion can be generalized to any large number of teeth, although the various cases are demonstrated only in small numbers here for simplicity. In short, to realize the one-to-one mapping of all target teeth, the minimum required aggregate bandwidth of the local and readout FCs needs to be equal to or greater than that of the target FC, i.e., + ≥ ( + 1). Note that there are two trade-offs behind this equation: a. The trade-off between the local tooth number and readout tooth location. If there are more local teeth, the location (frequency) of the readout teeth can be more flexible since there are more complete groups formed. Conversely, the requirement of the readout tooth location will be stricter if there are fewer local teeth. In practice, it is generally much more difficult to accurately control the frequency of the readout teeth with the precision of the repetition rate than to obtain more local/readout teeth. Therefore, the general practical solution could be to make the aggregate bandwidth of local and readout FC moderately larger than that of the target FC and to roughly control the frequency of the readout comb (e.g., with the precision of 0.1 nm). This is what we do in the experiment.
b. The trade-off between the number of teeth of the local FC and readout FC. As the equation suggests, fewer readout teeth are needed if there are more local teeth, and vice versa. It should be noted that, although in theory only the sum of the bandwidth of local FC and readout FC is regulated to realize the one-to-one mapping of the target teeth, a relatively broad local FC (short local pulse) will be more beneficial in practice, as it can provides a better time gating (Section 1.2 and 4.2) and a higher upconversion efficiency (Section 4.1).

Bandwidth requirements for repetition rates and carrier-envelope offset frequency (CEO) frequencies
In the last subsection, we discuss the bandwidth requirements on optical side. In this subsection, we discuss instead the requirements on RF side, specifically, , , , , , , and , . Without loss of generality, we continue the assumption that , = 0; thus, , is effectively the relative between the target FC and local FC. To quantify the requirements, here we define two important parameters (see the illustration in Supplementary Fig. 8): a. The spectral (frequency) distance from the first tooth of an SFG group to its primary readout tooth, denoted by D. Note that the "first tooth of an SFG group" refers to the SFG tooth that corresponds to the first target tooth (the tooth with minimum frequency in the target FC).
( , ) denotes the remainder after division of dividend A by divisor B, and , + , is the optical frequency of the first tooth of the target FC.
b. The spectral width of one complete group, denoted by W.
denotes the optical bandwidth of target FC.
Additionally, to realize a one-to-one mapping, two kinds of spectral overlap need to be avoided: a. Avoiding overlap between band A(D) and band B(C), which requires: Avoiding overlap between band B and band C, which requires: Similar to dual-comb spectroscopy (DCS), , needs to be small enough to provide enough bandwidth in the RF domain, i.e., to satisfy the requirement b. In addition, , also need to be determined carefully to satisfy requirement a, which is different with DCS.
Note that the above bandwidth requirements are effective when a single detector is used for heterodyne photodetection. For the case that an ideal balanced detector is used, the requirements are simplified to only one equation: This is because the band A and band D are eliminated by the balanced detector since they are common-mode signal from the SFG FC. In another word, the balanced detector can double the bandwidth for RF band B (C) assuming unchanged , , which makes the RF bandwidth requirement effectively same as the general dual-comb.

Comparison of principle between different techniques
In this section, we will compare DCS, C.W. upconversion spectroscopy, electric-optic sampling (EOS), and cross-comb spectroscopy (CCS) (Supplementary Fig. 9) using simple mathematical descriptions. Then, we will demonstrate that C.W.
upconversion and EOS are essentially two special cases of the cross-comb; the former uses a very narrow-band local "FC" with only one "comb tooth", and the latter uses a very broadband local FC (very short local pulse) which also functions as the readout FC. We will describe them in both the time domain and the frequency domain. Especially, we show that the CCS in a general configuration can utilize the optical bandwidth in a more efficient way, compared to EOS. In all these techniques, if the full electric field profile of the readout FC (local FC) is available, generally acquired by field-resolved measurements (e.g., FROG), the electric field of the target FC can also be reconstructed based on measured correlation signal. This extra information could be helpful in some ways; however, it is not necessary for the goal of general absorption spectroscopy.

Supplementary Fig. 9| Simplified schematics of different techniques.
Note that generally balanced detectors are used, which are simplified to be single detectors in the schematics. Also, there may be additional equipment before the detector, which is also omitted here; for example, an ellipsometry setup for electro-optic sampling (e).
To begin with, let us review the cross-correlation theorem: Or equally: Where ( ) and ( ) denote the Fourier transform of ( ) and ℎ( ), respectively.

DCS
Firstly, for DCS with a symmetric (collinear) configuration (Supplementary Fig. 9(a)) 9 , Let assume the sample's spectral response is ( ) , including both spectral amplitude | ( )| and spectral phase ( ( )). If we use ( ) and ′( ) to denote the electric field of a pulse before and after passing the sample, we have: (26) Therefore, for the cross-correlation signal ′( ), measured when the target pulse and readout pulse pass the sample: By comparing those two measurements (with and without sample), we have: ( ) denotes the comparison between those two measurements. It shows that this measurement can only provide spectral intensity of the sample's response, which lacks the phase information.
In fact, a symmetric DCS measurement is essentially a traditional FTIR (Michelson interferometer), which gives information only about spectral intensity but not spectral phase. Therefore, one cannot get any temporal information on the target pulses which are disturbed by the sample. In other words, the correlation signal ( ) is independent of the spectral phase of ′ ( ), which is cancelled as the readout pulse also passes the sample.
Secondly, for DCS with an asymmetric (dispersive) configuration (Supplementary Fig. 9(b)), Note that in this configuration, the readout pulse does not pass the sample before being combined with the target pulse. In this case, the measured ( ) is dependent on the phase of ( ); thus, one can get phase information of the sample response.
However, one still cannot recover the full electric field of the target pulse, ( ) (or ( )) only by measurement of ( ) (or ′( )), in which ( ) (or ( )) is modulated by * ( ). This is because * ( ) is generally unknown unless some other field-resolved measurements (e.g., FROG) are applied to measure it. Nonetheless, general absorption spectroscopy does not require the full knowledge of ( ), since what we need to measure is ( ) rather than ( ), assuming ( ) does not change for measurements with and without the sample.

CCS
Thirdly, let us discuss CCS, which has the additional step of frequency conversion (Supplementary Fig. 9(c)).
Step 1: nonlinear upconversion where ( ) denotes the electric field of the local FC (pulse). Note that this equation is approximated that needs to be based on proper assumptions and approximations, the main of which include slowly varying envelop approximation (SVEA), weak nonlinearity, medium without loss and dispersion, unaffected input beams, and ideal phase-matching. In addition, for clarity, we omit all the proportionality constants since we are mainly interested in the shape of the pulses/spectra. Those conventions are same as what are generally used in the community of ultrafast pulse measurement 10 , where more details can be found.
Step 2: linear readout (same as asymmetric DCS) Like asymmetric DCS, one can get phase information of the sample response, but ( ) cannot be fully recovered since ( ) is modulated by * ( ) in ( ). However, this does not impede the measurement of the absorption spectrum ( ).

C.W. upconversion and EOS
Both C.W. upconversion and EOS can be shown to be special cases of the above CCS description. To describe C.W. upconversion ( Supplementary Fig. 9(d)), nothing needs to be modified in the CCS equations, except that ( − ) denotes a continuous sinusoidal wave instead of a pulse. Also, it should be noted that, using an SFG or DFG process for nonlinear upconversion does not make a fundamental difference here; the equations are equivalent up to a complex conjugation. The illustration is shown in Supplementary Fig. 10(c). Fig. 7(e)) requires a more careful discussion. Let us start with equations of CCS.

EOS (Supplementary
Step 1: nonlinear upconversion With this approximation, we can continue to derive the next readout step. Note that in EOS the role of readout pulse is played by the local pulse itself.
Step 2: linear readout where K denotes the constant that equals to the integration ∫ ( − ) * ( − ) ∞ ∞ , the core of which is independent of the parameter delay . As shown in the equation, under this approximation, the correlation signal ( ) is equal to the electric field of target pulse ( ) up to a constant. Thus, under the approximation of the ideal local pulse (infinitely short pulse width), one can obtain the full electric field of the target pulse ( ) in addition to the absorption spectrum ( ).
In practice, the finite pulse duration of the sampling pulse always imposes a frequency-dependent instrument response 11,12 , which is illustrated in Supplementary Fig. 10(d). In this case, the instrument response function * ( ) is the "autoconvolution" of the local spectrum.

C.W. upconversion and EOS described by comb-teeth mapping
In the previous subsection, we have described C.W. upconversion spectroscopy and dual-comb EOS using the language of CCS without including comb teeth. In this subsection, we do the same thing factoring in comb teeth, following the derivation in Section 2.5. Note that Fig. 4 of the main paper is a good illustration for this subsection.
Based on what we derived for RF band B in Section 2.5, we can write the general formula for target tooth mapped in RF band B: where M denotes the total number of local teeth. Note that all the subscripts denote effective tooth index.
For the case of C.W. upconversion, there is only one "local tooth", so the formula is simplified to be = * (51) Everything can be described well by the language of CCS.
For the case of ideal EOS, let us review the approximation that we made in the time domain, which is: "In the span of ( ) ( ) (very short local/readout pulse), ( ) (target pulse) varies slowly, and thus can be approximated as constant." Correspondingly, in the frequency domain, we can have such an equivalent approximation: "In the span of ( ) (very narrowband, relatively), ( ) or ( ) (very broadband, relatively) varies slowly and can be approximated as constant." With this approximation, we have: where N denote the total number of target teeth. Thus, we have: where K denotes a constant.
This result is equivalent to the equation ( ) = ℱ{ ( )} = ( ), which we derived in the last subsection in the time domain. Both results show that, in the limit of ideal EOS, the measured correlation signal (RF heterodyne beating) is equal to the electric field of the target pulse up to a constant.
The case of nonideal EOS is well demonstrated in reference [5].
In summary, both C.W. upconversion spectroscopy and EOS fall into the category of CCS, representing two opposite limits on the bandwidth of the local comb.

Comparison of performance between different techniques, by theoretical model
In the last section, we compared how different techniques work. With the same model and assumptions, in this section, we further compare some of their important metrics, including detection bandwidth, efficiency, SNR and dynamic range. We will first compare detection bandwidth and efficiency of different upconversion methods, pointing out that the short-pulse CCS is overall more efficient. Secondly, we will present a comparison between DCS and short-pulse CCS in terms of SNR and dynamic range, highlighting the effect of temporal gating 3 . Lastly, some insights into the design rules of CCS systems are provided, based on the results of this section.

Detection bandwidth and efficiency of upconversion methods
4.1.1 General CCS, symmetric CCS, and C.W. upconversion CCS In last section, we defined a response function ( ) to describe different methods. Here, we will continue to use it for a more quantitative comparison. With the same mathematical assumptions as before ( Supplementary Fig. 10), the dimensions of ( ) of the general CCS case are calculated, based on the given parameters (heights and width) of ( ) and ( ) ( Supplementary   Fig. 11(a)).
In practice, it is more general and convenient to discuss and compare spectral intensity (power) instead of spectral amplitude of the electric field. Thus, the amplitude spectrum is squared to the intensity spectrum, and a power gain function ( ) = | * ( )| is defined to describe the detection efficiency of the system (Supplementary Fig. 11(b)). Three metrics of ( ) are used to quantify it: ℎ : the maximum value ("height") of ( ), which describe the highest detection gain at the center part of ( ).
: the bandwidth of (ω). Here we simply use zero points to define the width.
: the area under ( ), i.e., ∫ ( ) . This quantifies the total gain of the system.
Under our assumption of rectangular spectral profiles for ( ) and ( ), we get: where we assume ≤ . Please refer to Supplementary Fig. 11 and its caption for a detailed definition of variables. Note that ( ) , which denotes the area under ( ) ( ) , is equivalent to the average power of the local (readout) FC.
There are in total four effective free variables: , , , . For reasonable comparison later, let us assume a fixed total bandwidth = 2 = + and a fixed total power = 2 = + for the local and readout FC.
Let us first consider the choice of and . Based on the equation of ℎ , we have Thus, we want to make = = to optimize ℎ (this also optimizes ). In this case, = , which is a constant.
Secondly, let us consider the choice of and . To optimize ℎ , it is obvious that we want to make = = , thus we have ℎ = = .
As for , a more careful calculation is needed. Since is already set to a constant, let us consider function The conclusion we arrive at here has its roots in the same reason why people prefer short pulses over C.W. lasers for nonlinear optics: the much higher peak power of short pulses.

Symmetric CCS and EOS CCS
Here we compare short-pulse CCS with EOS CCS, the former of which is represented by the symmetric CCS, and the comparison is illustrated in Supplementary Figure. 13. Still, we keep their total bandwidth and power the same to make a fair comparison.

Supplementary Fig. 13| Gain function ( ) of symmetric CCS (a) and EOS CCS (b).
Note that the two panels are not on the exact same Y scale. As denoted in the figure, the heights of ( ) are ℎ , = and ℎ , = 4 , respectively (The latter is four times of the former).
Based on the calculation shown in the plot, although the maxima of ( ), i.e., ℎ , , is larger than that of the ( ) i.e., ℎ , , the ℎ , is at = 0, which cannot overlap with target spectrum at all. Actually, In this case, the maximum value of ( ) is just 4/9 of that of the ( ), and this value is only at the left edge of the target spectrum. If we instead compare the gain at the center of the two functions, the ratio becomes only 1/9. Since they are of different profiles, it is more reasonable to compare their overall gain , and , is only ~15% of , . In short, although EOS may require much more experimental effort, its overall detection efficiency can be much lower than symmetric (general short-pulse) EOS. As shown clearly, the width of ( ), total bandwidth , decreases monotonically with the local bandwidth .

Comparison of General CCS, symmetric CCS, and C.W. upconversion CCS under a different assumption
Similarly, the area of ( ), total gain , also decreases monotonically with local bandwidth , the trendline of which is shown in 14 (e).
The scaling of the height of ( ), highest gain ℎ , is slightly different. It gets maximized when = (Symmetric CCS), as we derived before in Section 4.1.1. However, this metric is less important than the other two.
In short, as the local FC gets broader in frequency domain (shorter in time domain), the overall performance of CCS increased. This agrees with our general understanding that shorter pulses lead to larger peak power, which benefits efficiency of nonlinear process.

Supplementary Fig. 14| Gain function ( ) of symmetric CCS (b), general CCS (a, c), and C.W. upconversion CCS (d), and scaling with (e),
under a different assumption. In (d), the local FC spectral intensity profile is like a Dirac-delta function, which has a very small width ′′ and a very large height ℎ′′ , while their product is kept same as the other three cases. In (e), note that the abscissa, local bandwidth , is normalized by , and the ordinate, total gain , is normalized by .

Summary
In this subsection we compare symmetric CCS, general short-pulse CCS, C.W. upconversion CCS, and EOS CCS. Compared to C.W. upconversion CCS, general short-pulse CCS can have a much higher detection (upconversion) efficiency, which comes from the enhancement of peak power of short pulses over C.W. laser. Compared to EOS CCS, general short-pulse CCS (represented by symmetric CCS) can have a much larger detection bandwidth and detection efficiency, although it is much less experimentally demanding. Overall, among different upconversion configurations, short-pulse CCS has advantages in bandwidth, efficiency, flexibility, and experimental complexity.

Overview
In this part, we will provide a quick qualitative description. In asymmetric DCS, we have the correlation signal: Note that only the cross term, i.e., the effective correlation signal, is kept in this equation. The background that is omitted in the equation is: This background is equal to the sum of the full power of the target pulse and local pulse, which is independent of the delay, . At large delay , when the weak tail (optical free induction decay) of the target pulse is being sampled by the local pulse, the effective correlation signal can be much smaller than the constant background. In other words, the extra noise incurred by the background from the strong target pulse can envelop the weak useful signal at the tail. Even in the absence of technical noise, the strong background can saturate the detector, thus fundamentally limiting the dynamic range and SNR of the measurement 3 .
In CCS, in which a short local pulse is used (not necessarily as short as in the EOS case), the correlation signal is: The omitted background terms are: In stark contrast to DCS, the background in CCS is dependent on the delay , as the target pulse is "temporally gated" by a short local pulse ( − ). At the weak tail of the target pulse, where the effective correlation signal is weak, the background is also very weak, as it is free from the strong power of the center (peak) part of the target pulse. This allows a much stronger target pulse to be used, which promises a higher SNR at the weak tail, compared to the linear DCS. This behavior is well shown qualitatively in Fig. 1(e) of the main paper.
It is readily seen that the temporal gating effect is better as the local pulse is shorter. Also, a shorter local pulse benefits the upconversion efficiency. This is one of the reasons why we use a relatively short local pulse (broadband local FC) in our experiment, although only the total bandwidth of the local FC and readout FC is regulated in theory to fully map the target FC.

Assumptions of the model
To compare the SNR and sensitivity of DCS and short-pulse CCS quantitatively, firstly, we need to set up our model (Supplementary Fig. 15  (2) After the target pulse goes through the sample, we describe it by ( ) = ( ) + ( − ) exp( ) = ( ) + ( − ) (see panel (a)). The first term denotes the original probing pulse (center-burst), and the second term denotes the FID from the sample. denotes the amplitude ratio between the center and FID, which is ≪ 1 if assuming a weak absorption measurement. denotes the time interval between the FID and the pulse center, and we assume it is ≫ (the pulse width of ( )), i.e., the FID signal is far enough from the center thus the field amplitude here is not influenced by the term ( ). Indeed, it is not physically sound to assume the FID signal has the same pulse shape as the original excitation pulse. However, what matters for the following calculations is the amplitude ratio between the center-burst and the FID signal, and these assumptions simplify the math without changing the core of the calculation. According to the derivation in part Ⅱ of the supplementary material of ref. 3 , approximately equals the absorption up to a few other factors. This assumption means the gating function effectively separates different temporal parts of the target pulse, and only the part that overlaps with the gating function can influence the detector value at a specific delay . As for the readout pulse, we assume it has the same envelope as the readout of DCS, although with a different optical frequency .
(5) For both DCS and CCS, we use a single slow detector that samples at a rate , the same as the repetition rate of the readout FC ( = , = , ≅ , , = ≫≫ ≫ ). Taking DCS as an example, at a delay , the detector current can be represented by: ( ) = ∫ | ( ) + ( − )| , where denotes a constant parameter that converts the result of the integration (equivalent to optical power) to photocurrent. Note that includes some physical constants related to the electric field as well as the quantum efficiency and responsivity of the detector, which are not main subjects of this study. The parameter, , and integration limits, and − , will be omitted to simplify the equation later, since the calculation is not generally sensitive to them.
(6) There are three kinds of noise that would be generally included in the SNR discussion: detector noise (NEP), shot noise, and relative intensity noise (RIN) 13 . For clarity, we do not include the RIN in this calculation. Therefore, unless the optical power is very low, shot noise is the main noise source here. Since we are going to apply the idea of "temporal gating", we study the SNR of the measurement in the time domain.
Let us first start with a typical ideal DCS measurement (no FID). We assume ( ) = ( ) = ( ) exp( ), and = ∫ ( ) . Please refer to Supplementary Fig. 15 (c) (the center-burst part) for illustration of the following discussion.
When | | ≫ , i.e., the two pulses do not overlap, and we have a background signal: At = 0, when the two pulses constructive interfere, i.e., the maxima of the interferogram, we have: At ≅ 0, when the two pules destructively interfere, i.e., the minimum of the interferogram, we have: Thus, the range of the interference here, denoted by , is 4 , which can be understand as the amplitude of the "signal". To evaluate the noise, we define the base current around = 0 as the average value of (0) and (0), which is: In this case, the base current around the maxima is equal to the background . Since shot noise here increases with , to optimize SNR, one wants to optimize the ratio / . Actually, this ratio is equivalent to the "interferometric visibility" up to a factor of 1/2.
The shot noise around = 0 can be expressed by: denote a constant parameter that convert √ into current noise. Like , includes some physical constants which are not the main subjects of this study.
If the optical power is low, the dominant noise is detector noise, and the SNR of the measurement is: The dominant noise becomes the shot noise when the optical power is higher, and SNR of the measurement is: Apparently, the SNR increases with , and this dependence agrees with general shot-noise-limited measurement since ∝ the power of the target (or readout) pulse. Also, the above two equations are equivalent to equation (24) and (25) Supplementary Fig. 15 (c).
At the center-burst, (0) = 4 = , and the SNR here reaches maxima, as explained above.
However, at the FID, everything changes. We have: In short, although the amplitude of the signal (the interference) is times weaker than that at the center-burst, the noise level is still the same since the detector can see all the background from the center-burst, i.e., the large energy which contributes to only the noise here but not the signal.
Let us then consider CCS. (See Supplementary Fig. 15 (d).) At the center-burst, since we assume This shows a √2 SNR enhancement of CCS over DCS, which is not significant. In fact, here we greatly limit the capability of CCS. On one hand, in this simple model we assume the same detector performance (NEP, responsibility, quantum efficiency, saturation level, and etc.) of the NIR detector of CCS and MIR detector of DCS, the latter of which should be worse the former.
On the other hand, we still limit the optical power, especially the local pulse, of the CCS, to avoid detector saturation at the center-burst, which is not necessary for detecting weak absorption.

SNR at FID, with temporal filtering
Now we introduce a temporal filter that throws out the interference signal at the center-burst (before the FID) and only keep the interference around FID for the detection of weak absorption 3 . In this case, we no longer care about the detector saturation at the center-burst; the limits of the SNR at the FID are set by the detector saturation at the FID locally.

√2
( − ) , to maximize the SNR here. See Supplementary Fig. 15 (e). We cannot increase more since ( ) already saturates the detector. We have: Compared to not-gated case, the SNR can only be increased by a factor of √2. For CCS, thanks to temporal gating by the local pulse, the detector signal at the FID is free from the power of the center part of the target pulse (the second term of the RHS of the above equation of ( ), i.e., ∫ 2 ( ) ). Thus, we can increase the power of the local pulse by a factor of ( ) , i.e., ( ) = ( ). Then, at the FID, (See Supplementary Fig. 15 Therefore, a √ SNR enhancement is demonstrated for CCS compared to DCS at FID, assuming sufficient upconversion.
The weaker the absorption, the stronger the enhancement can be. The reason for this lies in the fact that the DCS signal at the FID always comes with a factor of 1/ stronger DC background which contributes only to the noise but not the signal, while CCS does not. This comparison is illustrated in Fig. 2d of the main paper.
In practice, one can "infinitely" decreases by decreasing the sample concentration or gas cell length. However, the enhancement ratio √ cannot be infinitely increased; the limit is set by two factors, whichever comes first: 1. The SFG efficiency. To fully reach the SNR enhancement, one needs to upconvert the target pulse by a factor 1/ stronger using local pulses with a higher peak power. However, this can be clamped by the highest available peak power of local pulses or the damage threshold of the SFG crystal.
2. The damage threshold of the NIR detector. Although the detector saturation at the center-burst no longer matters if we discard the signal there, we still do not want the power there to damage the detector. Generally, however, the detector is damaged by the average power rather than the peak power, and the average power on the detector mainly depends on that of the readout pulses instead of the local pulses. In other words, just a very small portion of the local pulse average power contributes to the total average power on the detector, the ratio of which is decided by the "duty cycle" of the SFG process (see Section 1.3). In temporal-filtered CCS, the strategy is to use stronger local pulses while keeping the readout pulses unchanged, which adds only a tiny optical average power on the detector. Hence, it is very unlikely that the limit of this factor comes earlier than the former.

Sensitivity and dynamic range, with temporal filtering
Following the SNR calculation, a comparison of sensitivity (the minimal detectable ) becomes straightforward. Let us define , the minimal detectable that makes SNR=1. For DCS at the FID, we have: For CCS, the sensitivity depends entirely on the upconversion capability, as discussed before. Let us assume the upconversion conversion ratio is 1/ . Then for an absorption , as derived before, we have an SNR = at the FID if we set ( ) = ( ). However, this SNR is much larger than 1 and more than enough to detect the absorption signal. Thus, we can further decrease the absorption. Let us assume an extra absorption factor to make the target FID a "small signal".
Simultaneously, to maximize SNR, we set ( ) ≅ 4 ( ). Then, at the FID, we have: Adding the ratio back, we have: Therefore, CCS can detect a smaller absorption compared to DCS. The detection is limited by the upconversion capability rather than the detector.
It should be noted that, in order to maximize the sensitivity (minimize the detectable ) for both techniques, we set their parameters ( ( ) and ( )) to make close to detector saturation. However, in such settings, there will not be dynamic range for the detection: a larger absorption will excess the saturation limit, and lower absorption will make < 1 thus not detectable. In short, for a detection where we want to get good dynamic range, we do not want to use the settings for the best sensitivity.
We continue to compare their dynamic ranges. In the interest of fairness, we used different power settings for each method to optimize their respective DR while keeping their sensitivity (minimal detectable absorption, the lower limit of the detectable range) the same. The comparison is illustrated in Fig. 2e.
In CCS, for an arbitrary low absorption , we keep powers of readout FC ( ( )) and SFG signal ( ( , )) the same and to be a quarter of the detector NEP. Note that SFG power can be tuned by either target power ( ( )) or local power ( ( )).
In such power setting, the interference range ( ) would be equal to the , making the absorption just detectable. Note that the dominant noise source in this scenario is detector noise (NEP) rather than the shot noise as readout power and SFG power are both tuned very low. Any stronger absorption would make ( , ) larger and more detectable until the detector is saturated. In other words, the interference range can vary from to , which means the DR of the absorption ( ) can utilize the full DR of the detector and we have = .
The square root here is simply because we use , a ratio of optical field, to quantify absorption, but detector current is proportional to optical power (square of the field).
In DCS, if we keep the readout power and target power here as same as those of the CCS case, we will get a same interference range = , which makes the same small absorption detectable. However, unlike CCS, a large extra DC background will always be seen by the detector due to a lack of temporal gating. This power can occupy a large portion of the detector DR and therefore limits the range of the detectable absorption. Since this extra background is a factor of larger than the interference background in CCS, the dynamic range of DCS would be about a factor of smaller than that of CCS, i.e., = ( ) = .
In fact, this still underestimates the difference between them, since in DCS the dominate noise would become the shot noise at the same because of the large background. Therefore, even larger target power or readout power has to be used to make the same absorption detectable, i.e., make the interference larger than the total noise, which is now the sum of shot noise and detector noise. This explains why for the same , in Fig. 2e of the main paper, the interference of the DCS looks larger than that of the CCS; it has to be made larger by a larger target or readout power, to overcome a larger noise. This larger target or readout power will then occupy more . Moreover, the lower the set to be, the more has to be occupied as higher target or readout power has to be used in DCS to make the signal detectable, and larger the difference between DCS and CCS will be.

SNR scaling with power of different combs
Here we provide a direct analysis about how temporal SNR of the interferogram scales with powers of different combs, based on the same model and assumptions.
First of all, as the SFG process in our experiment and theoretical model is far from saturation or depletion, the SFG output power is expected to be linear to target power or local power, which is consistent with what we observed in the experiment.
Then the question becomes how SNR scales with the SFG power, which depends on the relative power of the other arm of the interference, readout power, and the current dominant noise source of the detection. This is equivalent to how a typical interference (between two short pulses) SNR scales with the power of one arm. We can use a simple equation to explain this: with which we have, Then there are a few different cases: a. When SFG and readout power are both low and the dominant noise is the detector noise, the SNR scales linearly with the square root of SFG power or target/local power ( ∝ , = ).
b. When SFG power is high, readout power is low, and the dominant noise is the shot noise from SFG power, the SNR does not change with the SFG power or target/local power ( ∝ , ∝ ).
c. When SFG power is low, readout power is high, and the dominant noise is the noise from readout power, the SNR scales linearly with the square root of the SFG power or target/local power ( ∝ , ∝ ).
d. When both the SFG power and readout power are high and comparable, and the dominant noise is shot noise from both the SFG power and readout power, the SNR increases with the SFG power or target/local power ( ∝ , ∝ ( + )).
Note that in above analyses, the roles of SFG and readout FC are equivalent, and the roles of target power and local power are also equivalent. Those analyses are basically equivalent to that of typical optical heterodyne detection.

Summary
The discussion above demonstrates the fact that the sensitivity of short-pulse CCS is limited by upconversion capability (SFG efficiency), which is fundamentally different from DCS, where it is limited by the detector saturation. The beauty of the short-pulse-upconversion CCS is that strong local pulses can greatly enhance the peak power of the signal of interest in a localized temporal window, with minimal increase to the background signal and average power on the detector which add to noise and saturation of the detector. This time gating effect endows CCS advantages in SNR, sensitivity, and dynamic range.
Moreover, it should be noted that, among the three different upconversion configurations, only short-pulse-upconversion CCS can fully have these advantages. Firstly, C.W. upconversion CCS is basically DCS, which does not have advantages we discussed here at all. Secondly, for EOS CCS, one may expect it to have the same advantages since it also uses short pulses for upconversion, but this is not true if no more efforts are taken to independently control the power and spectrum of different spectral parts of the ultrashort pulses. Admittedly, the even higher peak power of the local pulse (because of shorter pulse length) used in EOS CCS can provide even higher upconversion efficiency. However, when the average power of the local pulses is increased to detect weaker absorption, that of the readout part of the local spectrum is also increased, which can saturate the detector unexpectedly, if they are not independently controlled. In other words, in short-pulse-upconversion CCS, you can always use higher local power to amplify a weaker FID signal while keeping the readout power unchanged, and you will never saturate the detector. However, in EOS CCS, you cannot do the same since local and readout are from the same pulse (spectrum), and thus their power cannot be tuned independently.

Comparison of performance between different techniques, by simulation
In previous sections we have compared principles and some performance metrics of different techniques using simplified theoretical models. In this section, we will provide a more quantified comparison by numerical simulation, to further demonstrate the advantages of CCS. We will focus on the SNR and sensitivity of the typical MIR DCS, MIR CCS without temporal filtering, and MIR CCS with temporal filtering, and how they scale with the sample concentration (absorbance).

Simulation assumptions
Although more quantified and accurate, the assumptions of our simulations are still basically consistent with those of the previous theoretical models. These assumptions will be explained in detail below.

Frequency combs
We For DCS, the comb powers are assumed to be enough to saturate the MIR detector (~1 mW), which is practical as many high-power MIR combs have been demonstrated in the past decade. For CCS, we also assume enough MIR power, local power, and nonlinearity, which is interchangeable with higher SFG power, to have a high enough upconverted SFG power to saturate the NIR detector (also ~1mW). This assumption is also practical considering our experimental results, recent progress in related areas, and state-of-the-art techniques as discussed in the main paper.

Sample
Here we use 1-meter-long CO2 of ambient level (~400 ppm) as the sample of unit concentration (relative concentration 10 0 in Supplementary Fig. 22), which is used in simulation for different techniques.
We model the CO2 response using the Lorentz oscillator model 14  Note that although we use target pulses centered at 4270 nm and CO2 absorption, this simulation can be adapted to other MIR wavelength easily, which would not fundamentally change the conclusions of this section.

Detectors
Detectors are an important factor in considering the differences between MIR DCS and CCS. We choose an InGaAs detector as NIR detector and a HgCdTe (MCT) detector as MIR detector, each of which is a very typical choice in its wavelength region.
While different detectors from different manufacturers can have very different performance metrics, we adapt the specifications of two commercial detectors from Thorlabs, FPD510-FS-NIR (InGaAs) and PDAVJ10 (MCT), for the simulation, which can well represent the general metrics of these two kinds of detectors. Note that we also refer to a review paper 16 for D * of these detectors.
For the NIR InGaAs detector, we assume:

Noise
As before, we include detector noise and shot noise in our simulation. For the assumed NIR detector, the detector noise will dominate at low input powers, while shot noise will dominate at higher input powers. For the assumed MIR detector, however, even at detector saturation, the shot noise will be about the same order of magnitude as the detector noise, as shown by the simulation below, because of its low responsivity and high NEP. Therefore, the shot noise and detector noise are both important at high input powers for the MIR detector. Note that this detector difference is not included in our theoretical model above, making that model effectively less advantageous to CCS.

Nonlinearity
For simplicity, we assume an ideal nonlinear conversion process where the upconverted field is the product of the target and local field (the SFG part), which is consistent with the theoretical model. The power efficiency is estimated by the standard SFG model with the assumption of quasi-C.W. operation 17,18 . The nonlinear crystal is assumed to be lithium niobate. Although a more accurate model for the nonlinearity could be used to give a more accurate estimation of upconversion efficiency and bandwidth, these estimates will not affect most parts of this simulation, because generally there will be enough SFG power to saturate the detector. The accurate estimation of upconversion efficiency will only be important to estimate the limits of the CCS with temporal filtering, as we will explain later.

DCS
With all the parameters assumed, the DCS interferograms can be simulated, as presented in Supplementary Fig. 16. Note that it is because we use a relatively strong absorption here that the effective interference at the FID is fairly observable compared to the background (see (f)) and the noises (see (h)), but the signal could be easily overwhelmed by the noise if a much lower absorption were considered. By taking the Fourier transform of those temporal signals and noises, spectra can be obtained, which are shown in Supplementary Fig. 17. In panel (a), the spectral amplitude of the ideal reference measurement, the ideal absorbed measurement, and the total noise (sum of the two kinds of noises) are presented, and their average amplitudes are displayed in the legend box.
The spectra are truncated to an interval with endpoints where the reference amplitude equals the noise amplitude. Note that the frequency axes are obtained from the direct Fourier transform, which needs to be linearly mapped to real frequencies. However, it is not done in the figures as it is not necessary for our purpose. The spectra of the two noises are presented individually in panel (b). As mentioned before, for this MIR detector, the detector noise is still close to the shot noise even when the detector is saturated in reference measurement, so both are important.
The real "signal", corresponding to "noise", in the absorption measurement, is neither the reference spectrum nor the absorbed spectrum, but the difference between them. The difference spectrum is depicted in panel (c), together with spectra of the reference and the noise. The SNR of the measurement, defined as a spectral average, is the ratio of the average amplitude of the difference spectrum to that of the noise spectrum. The SNR increases with the absorption (sample concentration) as the difference amplitude increases with the absorption and approaches its upper limit. The upper limit, i.e., the max SNR, is the ratio of the average amplitude of the reference spectrum to that of the noise spectrum, because difference amplitude cannot be larger than the reference amplitude. In other words, the difference spectrum approaches the reference spectrum when the absorption is very large, while the SNR approaches its maximum. On the other side, lower absorption would result in a smaller

CCS (with temporal filtering)
This detection scheme is different with two cases above, and its basic idea comes from a previous work on EOS 3 . We will cut the interferograms, both reference and absorbed, at a specific delay, ( > 0), and keep only the part > for the detection of the sample. Since the center of the interferogram (around = 0) will be cut out, the detection saturation there will be acceptable, and a larger SFG (by tuning target and/or local power) or readout power can be applied. In this example simulation, we keep the readout power the same and increase the target power tenfold and local power fivefold compared to the case (CCS without temporal filtering) in the last section. Next, a = 0.5 is applied, the results of which in time domain and frequency domain are presented in Supplementary Figs. 20 and 21, respectively.
In contrast to the two previous cases, here the absorbed amplitude is generally larger than the reference amplitude (21 (a)).
Although the estimated SNR can be significantly higher than the without-filter case, the signal obtained here is more useful for detecting the presence of the molecule than recovering its complete fingerprint. In other words, as part of the information is excluded by temporal filtering, the original and complete absorption spectrum of the sample cannot be retrieved, at least directly, like in general DCS and CCS. Clearly, different settings of powers and can lead to different SNR results, and here we only show one possibility. A complete and systematic discussion and optimization of them could be useful but involved, so they are therefore beyond the scope of this section. Nevertheless, the choice of is still worth additional discussion. While a larger can further decrease reference amplitude (subtrahend), it also decreases the absorbed signal (minuend) since the FID signal generally decays exponentially in the time domain. In this example, the choice of = 0.5 is a balance between the temporal amplitudes of the reference signal and absorbed signal, on one hand. On the other hand, at this given setting of power and concentration, this choice of timing ensures as much information is preserved as possible without saturating the detector. Moreover, though we highlight the FID signal from 14-34 ps throughout the paper, we do not use a near there in this simulation for two reasons.
Firstly, despite the minimal residual reference signal there, the absorbed signal at such a large time delay is also weaker than that closer to = 0 (see 21 (e)-(h)). A choice of that is too large, e.g., 10 ps, will eliminate a significant portion of useful signal, for example, signal from 0.5-4.25 ps as shown in panel (e). Secondly, and more importantly, such a distinct peak at such a large time delay is a unique feature of CO2 (linear molecule) 6,19 , which is special compared to more general molecules.
Therefore, by using a much closer to 0 than that unique FID peak, we demonstrate that this method works, and our related claims hold, for more general cases and do not have to rely on such special features, although our theoretical model (Fig. 1e,   Fig. 2d-e, and Supplementary Fig. 14) assumes a picture more like this feature for clarity.

Comparison
The trendlines between relative sample concentration and SNR for all three schemes are depicted in Supplementary Fig. 22.
Each data point denotes the highest possible SNR that can be obtained at that concentration. When the SNR is greater than or equal to 1, we assume the sample (absorption) is detectable. Otherwise, it is assumed undetectable, as it would be hard to distinguish the spectral difference from noise. The abscissa (concentration) of the intersection between the line of SNR=1 and each curve can be understood as its sensitivity (minimum detectable concentration, MDC).
Let us first discuss DCS and CCS without temporal filtering. Although a different sample concentration could lead to a different absorbed measurement, their reference measurements, which already saturate their detectors, do not change. Therefore, at different concentrations, neither increasing nor decreasing the optical power would further optimize the SNR, so that we keep the same power setting for the SNR estimation at different concentrations. In other words, each data point in the curves denotes the highest possible SNR one may possibly get at the given concentration, which is limited by the detector saturation instead of the optical power. Compared to the unit concentration, a higher concentration might lead to a higher SNR because of a larger difference spectrum, until it approaches its upper limit, as explained in Section 5.2. A lower concentration would decrease the SNR all the way to zero, with an SNR<1 being regarded as undetectable. Since the SNR would decrease at a certain fixed rate, the sensitivity (MDC) is determined by the max SNR, which fundamentally depends on the detector. highest SNR is limited by detector saturation, as is their sensitivity (minimum detectable concentration, the intersection between each curve and the line of SNR=1). For MIR CCS with temporal filtering, while its highest SNR is still limited by the detector saturation, its MDC is fundamentally limited by the strength of the nonlinearity, which determines where the SNR starts to decrease with the concentration.
Let us then discuss CCS with temporal filtering, which is slightly different. For a concentration higher than the unit concentration, the SNR could not notably increase despite a stronger FID tail, because the detector is already set to be saturated in the absorbed measurement at the unit concentration, and the spectral amplitude of the reference measurement is already very low. For a lower concentration, if we keep the same optical power and , the FID tail will get weaker, so the SNR will decrease.
However, if we can apply a higher power (target or local) to compensate the lower absorption, we can keep the amplitude of FID signal the same and still saturate the detector. As such, the SNR is estimated to be the same as that at the unit concentration, which explains the part of the plateau extending to concentrations smaller than 10 0 . This plateau can be kept until there is no more optical power (upconversion capability) available, after which the SNR will start to decrease with the sample concentration in a way similar to the other two cases. Therefore, unlike the other two cases, the sensitivity (MDC, the intersection) here is determined by the highest upconversion capability, which is decided by target power, local power, and nonlinear platform together, as claimed and discussed in the main paper. In this simulation, we assume the availability of roughly a factor of 10 higher nonlinear upconversion strength as compared to the parameters used at the unit concentration, to keep the SNR from decreasing until a concentration as low as 10 -1 (the turning point). The practical values of the turning point and sensitivity will depend on the specific experimental conditions, for which a more accurate estimation would require a more accurate model of the nonlinear conversion process.
In summary, in this section, we demonstrate that the MIR CCS (without temporal filtering) can have a higher SNR and sensitivity compared to the MIR DCS, thanks to the advantages of NIR detectors and smaller noises due to the reduced background signal. In both cases, we show their SNR and sensitivity are fundamentally limited by the detector saturation and noise if high enough optical power is used, under our assumptions of the noise sources. Moreover, CCS with temporal filtering can provide even higher SNR and sensitivity because of its different detection methodology, which will be fundamentally decided by the upconversion capability instead of detector saturation. However, unlike the other two methods, it cannot provide